# Vision-Language Pretraining
Blip Custom Captioning
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation
Image-to-Text
B
hiteshsatwani
78
0
Sail Clip Hendrix 10epochs
A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs
Text-to-Image
Transformers

S
cringgaard
49
0
Vit So400m Patch14 Siglip 384.webli
Apache-2.0
Vision Transformer model based on SigLIP architecture, containing only the image encoder part, utilizing raw attention pooling mechanism
Image Classification
Transformers

V
timm
9,429
0
Vit Base Patch16 Siglip 512.webli
Apache-2.0
Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism
Image Classification
Transformers

V
timm
702
0
Minivla Vq Bridge Prismatic
MIT
MiniVLA is a more compact yet higher-performing vision-language-action model, compatible with the Prismatic VLMs project codebase.
Image-to-Text
Transformers English

M
Stanford-ILIAD
22
0
Biomedclip ViT Patch16 224
MIT
BiomedCLIP is a biomedical vision-language processing model developed by Microsoft, based on PubMedBERT and ViT architecture, specifically designed for the biomedical domain.
Multimodal Fusion
Transformers

B
ikim-uk-essen
1,296
3
Image Captioning With Blip
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation
Image-to-Text
Transformers

I
Vidensogende
16
0
Vilt Finetuned 200
Apache-2.0
Vision-language model based on ViLT architecture, fine-tuned for specific tasks
Text-to-Image
Transformers

V
Atul8827
35
0
Llava V1.5 Mlp2x 336px Pretrain Vicuna 7b V1.5
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.
Text-to-Image
Transformers

L
liuhaotian
173
17
Image Caption Large Copy
Bsd-3-clause
BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies
Image-to-Text
Transformers

I
Sof22
1,042
10
OTTER MPT7B Init
MIT
OTTER-MPT7B-Init is a set of weights for initializing Otter model training, converted directly from Openflamingo.
Text-to-Image
Transformers

O
luodian
53
3
Blip Test
Bsd-3-clause
Image caption generation model fine-tuned based on Salesforce/blip-image-captioning-base
Image-to-Text
Transformers

B
mooncakex
15
0
Pix2struct Large
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on image-text pairs, suitable for various vision-language tasks
Image-to-Text
Transformers Supports Multiple Languages

P
google
6,601
34
Blip Image Captioning Base Football Finetuned
Bsd-3-clause
A vision-language model pre-trained on COCO and fine-tuned on a football dataset, proficient in generating image captions
Image-to-Text
Transformers

B
ybelkada
71
2
Featured Recommended AI Models